An External Memory Approach to Compute the Statistics of Maximal Repeats from Whole Genome Sequences
نویسنده
چکیده
The objective of this paper is to develop an external memory approach to extract the maximal repeats from whole genome sequences with the statistics of these repeats across classes, where the definition of a class is determined on what kind of statistics one wants to compute. We proposed a heuristic method consisted of a bucketsort-like approach and the Chinese term extraction approach. The former was used to sort the suffixes of DNA sequences stored in files and the later was used to extract maximal repeats by scanning the sorted suffixes while computing the statistics of these repeats. The statistics of these repeats across classes might be useful for sequence classification and species identification.
منابع مشابه
RepMaestro: scalable repeat detection on disk-based genome sequences
MOTIVATION We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk re...
متن کاملExhaustive Computation of Exact duplications via Super and Non-Nested Local Maximal repeats
We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reason...
متن کاملSearching the genome of beluga(Husohuso) for sex markers based on targeted Bulked SegregantAnalysis (BSA)
In sturgeon aquaculture, where the main purpose is caviar production, a reliable method is needed to separate fish according to gender. Currently, due to the lack of external sexual dimorphism, the fish are sexed by an invasive surgical examination of the gonads. Development of a non-invasive procedure for sexing fish based on genetic markers is of special interest. In the present study we empl...
متن کاملEfficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome
MOTIVATION There is a significant ongoing research to identify the number and types of repetitive DNA sequences. As more genomes are sequenced, efficiency and scalability in computational tools become mandatory. Existing tools fail to find distant repeats because they cannot accommodate whole chromosomes, but segments. Also, a quantitative framework for repetitive elements inside a genome or ac...
متن کاملProfile of Eight Prophage Sequences Present in the Genomes of Different Acinetobacter baumannii Strains
ABSTRACT Background and Objective: Prophage sequences are major contributors to interstrain variations within the same bacterial species. Acinetobacter baumannii is a gram-negative bacterium that causes a wide range of nosocomial infections, especially in intensive care unit inpatients. Prophage sequences constitute a considerable proporti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005